## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are 4898 observations and 13 variables. But X is just a sequential count for each observation. There are 11 chemical variables: 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) And 1 quality varaible: 12 - quality (score between 0 and 10)
As we see in the table above, R thinks ‘quality’ variable is integer type, but in my opinion it should be interpreted as ordinal one, due to is a way to classify the wines from the besst to the worst. So I am going to do some transformations in the dataframe:
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
Statistical summary of the data:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality qualityCat
## Min. :3.000 3: 20
## 1st Qu.:5.000 4: 163
## Median :6.000 5:1457
## Mean :5.878 6:2198
## 3rd Qu.:6.000 7: 880
## Max. :9.000 8: 175
## 9: 5
It can be seen that there are any wine with the best quality (10) and neither with the worst (0). The majority of the wines are in category 5 and 6. So, in our dataset it going to be difficult to get conclusions about what makes a wine to have the best quality or the worst.
Firsts variables are related to acidity, so we are going to start plotting them.
These four parameters looks normally distributed. But the four cases there is some positive skewing. There are few values for the higher x-axis values.
So I am going to plot again this variables, but excluding the top 1% of values.
Now, it is seen clearly the normal distribution of these variables. But, for example, in ‘citric.acid’ there are some peaks.
Now, let’s see to plot the other concentration related variables:
Again all variables looks to be normally distributed, but it appears to be better to exclude the top 1%.
Excluding the top 1%, it can be seen than residual.sugar appears to be log normal distributed.
It can be seen that there is a bimodal sitribution since there is a population centered around lows values and other population around high values.
Let’s plot the other variables:
Quality is normally distributed, with the majority of wines in the middle bins. Density also looks normal with some positive skew. On the other hand, alcohol looks multimodal.
Let’s see density and alcohol without top 1%:
Density is normally distributed, but alcohol looks trimodal with low, medium and high alcohol content populations.
I am going to create a new variable I think is interesting. Residual.sugar / alcohol.
It is interesting the peak for the low values of sugar / alcohol.
The dataset has 4898 observations and 12 variables.
It has 11 relevant variables, 10 characteristics of the wine and one is the quality, a way to classify the wines from bad to good. But there is no one wine with quality equals to 10 and neither with quality equals to 0. So, the majority of the wines have medium quality.
I created a new variable to see the ratio residual.sugar / alcohol.
Algo I transform the quality variable to type categorical.
First, let’s see the correlation among the different characteristics of the wine.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## sugar_alcohol 0.09363299 0.04575732 0.102730408
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## sugar_alcohol 0.99001187 0.11932114 0.3143238443
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sugar_alcohol 0.429487399 0.87168339 -0.2013195265
## sulphates alcohol quality sugar_alcohol
## fixed.acidity -0.01714299 -0.12088112 -0.113662831 0.09363299
## volatile.acidity -0.03572815 0.06771794 -0.194722969 0.04575732
## citric.acid 0.06233094 -0.07572873 -0.009209091 0.10273041
## residual.sugar -0.02666437 -0.45063122 -0.097576829 0.99001187
## chlorides 0.01676288 -0.36018871 -0.209934411 0.11932114
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067 0.31432384
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218 0.42948740
## density 0.07449315 -0.78013762 -0.307123313 0.87168339
## pH 0.15595150 0.12143210 0.099427246 -0.20131953
## sulphates 1.00000000 -0.01743277 0.053677877 -0.01803066
## alcohol -0.01743277 1.00000000 0.435574715 -0.53683146
## quality 0.05367788 0.43557472 1.000000000 -0.13475048
## sugar_alcohol -0.01803066 -0.53683146 -0.134750485 1.00000000
Let’s plot the pairs of variables with higher correlation:
It is seen a clear positive correlation between these two variables (0.839). When sugar content increases, density does also.
In this case, the correlation is negative (-0.78).
These variables have a strong positive correlation (0.871)
Let’s explore the correlation between quality and some parameters. Because in the analysis it would be good to see if there is some characteristics that determine if a wine os good or not.
There is a slightly negative correlation.
It is seen that since quality 5, alcohol content median starts to increase. So, it looks thatwines with better quality tend to have more alcohol.
It is not easy to get conclusions about this relationships because of median values move up and down.
It can be observed that wines with higher quality have less density.
It looks that wines with higher quality have lower values of chlorides, but the decrease is very slightly. Also, there are lots of outliers in wines with quality 5 and 6.
It looks that density and chlorides decreases in better wines and alcohol increases.
The strongest relationship is between density and ratio sugar/alcohol.
First, as there is a strong relation between density and alcohol, let’s plot it with the quality.
It can be seen that higher quality wines tend to have high alcohol levels and low density.
For a given sugar / alcohol value, better wines look to have lower density than the worst ones.
We see in general better wines have lower density by a given sugar value.
There is no a clear relation here.
Density looks to be an important feature in wines. Also level of alcohol. Also the relation between sugar and density is important in order to decide the quality of a wine.
In this grahg we see the clear positive correlatin between sugar/alcohol anf density. And better wines are bellow the tendency line, but it can be seen more dark spots in the area of lower sugar/alcohol levels.
In my opinion this plot is interesting because it reflects the idea that better wines have more alcohol. It is quirious that in the case of bad wines is better to have less alcohol, but since medium quality, medium alcohol content starts to increase.
This plot is interesting because it can be seen that better wines tend to have more alcohol content and less density levels. It is quirious that there are some dark spoots in the left area of the plot, so there are some good wines with few alcohol, but high density.
This dataset have good information to get an idea of what makes a wine to be bad or good, but it contains many variables and some of them are likely related, like acid ones, and it’s difficult to make some conclusions.
The most strong positive correlation is between density and ratio sugar/alcohol (0.87), while the most negative correlation is between density and alcohol (-0.78).
For me, the most interesting insight is that better wines, in general, have high content of alcohol and less sugar, so the best wines are not so sweet.
With more time it could be analysed if some combinations of 3 or more features make special a wine. There are lot of wines in the medium quality but only 5 with quality 9 and no one with 10. It would be good to see why.